Actor Features

Before jumping into modeling, it is best to perform exploratory data analysis. Exploratory data analysis helps us to better understand our dataset. And when we understand our dataset we make better decisions about how best to model the phenomenon. Since we are trying to predict Oscar nominations with this model, we explored how the variables we’ve collected relate to Oscar nomination.

We started by collecting summary statistics and missingness for our dataset. We visualize some of the summary trends below. As for missingness, we found that quite a few variables had some missing values. In most cases, missingness was caused information from The Movie Database (TMDB) was not found for each actor. For example, Peter O’Toole, who was nominated for an Oscar in 2006, had no information matched in TMDB. This is because the first match for ‘Peter O’Toole’ in TMDB is a Peter O’Toole who works on movies in Lighting. In the future we would like to refine this dataset performing more manual checks to account for missingness and augment the existing data, but for now, due to the random nature of the missingness, we filter out observations with missingness when necessary.

To start, we looked at the birth year of actors in our dataset by nomination. We see different patterns in the distribution of birth years for nominees and non-nominees. Non-nominees are more likely to be born in later years, suggesting that, on average, commercially successful movies cast younger actors. This pattern may also suggest that actors don’t reach critical acclaim until they are older and more established in their careers.

Next, we looked into differences in the gender of actors by nomination. The number of male and female nominees is identical, which makes sense given that The Academy’s acting awards split along a gender binary. Ten female and ten male actors are nominated each year. It is surprising, on the other hand, that there are far more male actors in commercially successful movies than female. This suggests that commercially successful movies are more likely to have male cast members.

We also look at the nationality of actors by nomination. We find that American actors are more likely to be nominated for an Academy Award than non-Americans. While this trend is likely indicative of patterns in Hollywood, it also reflects how we defined commercial success. When we pulled commercially successful movies from TMDB, we did not limit the results to a specific region. So, our dataset includes many actors who starred in internationally successful movies that may not have been popular with American audiences or critics. In the future we may think about limiting the scope of how we define commercial success to improve the model performance. Although, including more international stars in our analysis also demonstrates the blind spots Hollywood critics have to non-Western films and actors.

Finally, we looked at the age at nomination for Oscar nominated actors by gender. This summary demonstrates the potential interaction effects of some our variables, in this case, age and gender. We see that female nominees are more likely to be younger, peaking in their mid-30s. Women also experience a sharp dropoff in nominations in their mid-40s. Men, on the other hand, reach their peak years in their early 40s, on average, and experience a more graduated dropoff in their careers.

Predictive Modeling

In addition the features explored earlier, we derived features from the time series for each calendar year for each actor. The outcome is nomination in the next year’s award ceremony, but for simplicity we describe this as occurring within the same year. We tested out using longer time series, for example, two years (which would include buzz generated while a movie was in production, for example), but we had issues with sparsity. Some actors were not prominent for several years before blowing up in popularity (i.e., several consecutive months with no interest), which can make feature extraction impossible. So, we filtered out observations with no variation in search interest.

Deriving time series features involves summarizing the properties of a time series, and we used the tsfeatures R package to accomplish this. tsfeatures extracts variables like linearity, curvature, trend, entropy, autocorrelation, and spikiness. 2 In addition to the extracted features, we also engineered a summary feature of our own: max_spike_height. This captures the maximum search interest for a calendar year, which would show if they reached the relative peak of their search interest during this time period.

Model Feature Comparison
Feature Model 1 - All Predictors Model 2 - TS Predictors Only
trend
spike
linearity
curvature
e_acf1
e_acf10
entropy
x_acf1
x_acf10
diff1_acf1
diff1_acf10
diff2_acf1
max_spike_height
nominated_previously
age
gender
american
cluster
won_previously

To test our hypothesis about the role of Oscar buzz in predicting nominations, we wanted to build two models. The first model would include all of the features we had engineered, and the second would include only features derived from the Google Trend. Comparing how these two models perform would help us understand the explanatory power of Oscar hype.

Ours is a classification problem: in a given year, will an actor be nominated for an Oscar? After testing a few different modeling strategies, including logistic regression, we decided to use Random Forest. Random Forest has many strengths as a machine learning model, but the one that was arguably most important for our purposes is to handle high-dimensional, noisy data. The first hurdle to overcome with Random Forest was to handle our large class imbalance. Our initial models were not very accurate due to the relatively low occurrence of nominees. In a given year only 20 actors are nominees, and several have been nominated multiple times, which doesn’t yield many observations for training the model. So we decided to undersample non-nominees to achieve a balanced training dataset. Once we started seeing more accurate predictions, we added a 5 fold cross validation to cover potential blindspots introduced by undersampling and provide more stable evaluation metrics. Cross validation allows us to be more certain about our model’s performance.

Model Accuracy - All Predictors
Metric Mean n Folds Standard Error
accuracy 0.73 5 0.01
roc_auc 0.81 5 0.01
Model Accuracy - TS Predictors Only
Metric Mean n Folds Standard Error
accuracy 0.72 5 0.01
roc_auc 0.80 5 0.02

According to basic metrics of model accuracy, we found that the two models performed extremely similarly. Both models have an accuracy of around 70% and an ROC AUC around 0.8, which is much better than a coin flip but not remarkably excellent. Low standard error on our accuracy metrics suggest that these estimates are reliable. Model 1 - All Features slightly outperformed, but the results still suggest that Oscar buzz is a powerful explainer of Oscar nomination.

Confusion Matrix Model Comparison
True Label Predicted Label All Predictors TS Predictors Only
0 0 1204 1255
1 0 432 390
0 1 17 23
1 1 49 34

Looking deeper into Type I (false positive) and Type II (false negative) error for the final Random Forest model, we find that Model 2 - TS Features Only makes less Type II error and more Type I error compared to Model 1.

Model Performance Comparison
Metric All Predictors TS Predictors Only
f_meas 0.179 0.141
sens 0.742 0.596
bal_accuracy 0.739 0.680

Although Model 2 makes less Type II error, it is far less balanced than Model 1. Model 1 has higher sensitivity and a better F1 metric than Model 2. Again, both models are not excellent, but fair okay in terms of metrics. Looking deeper into the actors misclassified may help us understand how to engineer better features for prediction.

Who are the Dark Horse Nominees?
All Predictors TS Predictors Only
Ryan Gosling (2006) Matt Dillon (2005)
Edward Norton (2024) Mark Ruffalo (2015)
Johnny Depp (2004) Colin Farrell (2022)
Daniel Kaluuya (2017) Sam Elliott (2018)
Daniel Kaluuya (2020) Robert De Niro (2023)

False negatives can be thought of as Dark Horse nominees. These are the actors with the largest incorrect threshold for non-nominee. Interestingly, they are overwhelmingly male. This suggests that adding some additional information to the modeling strategy, such as the fact that there can be only 10 male and 10 female nominees per year, might improve our ability to detect Dark Horses.

Who are the Oscar Snubs?
All Predictors TS Predictors Only
Michelle Rodriguez (2005) Emma Thompson (2013)
Hugh Grant (2022) Clint Eastwood (2018)
Mary J. Blige (2009) Julia Roberts (2018)
Harris Dickinson (2023) Annette Bening (2016)
Danny Huston (2021) Kristen Stewart (2020)

False positives can be thought of as Oscar snubs. These are the actors with the largest incorrect threshold for nominee. In this area, adding constraints on the model so that actors must be in a significant film in the year, not in a popular TV show or having a scandale, for example, could improve our ability to detect quality Oscar snubs.

Diving into feature importance, we see additional evidence that Google Trends derived predictors are the most influential in our models. The only non-time series variable with notable influence is age, which tracks with our findings from exploratory data analysis. Interestingly, previous Oscar wins and nominations are not very influential, which suggests that additional feature engineering to understand actor quality might not be fruitful.

Within Model 2 predictors, our engineered cluster is not very influential either, which suggests again that the solution could use fine-tuning to achieve more predictive power.

Because correlated features can be washed out in feature importance, we looked at correlation of numeric features to determine if we’re overlooking any important patterns. Linearity and curvature jump out as being highly influential predictors without many strong correlations, suggesting that their influence is accurately portrayed. Many of the time series autocorrelation features are correlated, so it may be of interest to weed out some of these predictors.

As discussed, there are several potential improvements we could make to our modeling strategy. Another consideration is to subset the Google Trends data to include only the years in which actors were actively involved in significant projects. In other words, rather than using all available trend data for each actor across all years, it may be more meaningful to focus on periods when they were actually working on major films. This would allow us to test whether the model can predict Oscar nominees specifically among actors with prominent, high-profile roles, rather than from the broader set of all actors regardless of activity.

In any case, the model as it stands lends evidence to the conclusion that Oscar buzz is a real phenomenon we can see in the data, and it does have some predictive power as to determining whether or an actor will be nominated.


  1. L. Kaufman and P.J. Rousseeuw “Finding groups in data. An introduction to cluster analysis.” Wiley, New York. 1990.↩︎

  2. R. Hyndman, Y. Kang, P., Montero-Manso, M. O’Hara-Wild, T. Talagala, E. Wang, and Y. Yang. “tsfeatures: Time Series Feature Extraction (Version 1.1.1)” Accessed: May 4, 2025. [Online]. Available https://pkg.robjhyndman.com/tsfeatures/↩︎